Goto

Collaborating Authors

 fairness gap



Investigating Label Bias and Representational Sources of Age-Related Disparities in Medical Segmentation

Parikh, Aditya, Das, Sneha, Feragen, Aasa

arXiv.org Artificial Intelligence

Algorithmic bias in medical imaging can perpetuate health disparities, yet its causes remain poorly understood in segmentation tasks. While fairness has been extensively studied in classification, segmentation remains underexplored despite its clinical importance. In breast cancer segmentation, models exhibit significant performance disparities against younger patients, commonly attributed to physiological differences in breast density. We audit the MAMA-MIA dataset, establishing a quantitative baseline of age-related bias in its automated labels, and reveal a critical Biased Ruler effect where systematically flawed labels for validation misrepresent a model's actual bias. However, whether this bias originates from lower-quality annotations (label bias) or from fundamentally more challenging image characteristics remains unclear. Through controlled experiments, we systematically refute hypotheses that the bias stems from label quality sensitivity or quantitative case difficulty imbalance. Balancing training data by difficulty fails to mitigate the disparity, revealing that younger patient cases are intrinsically harder to learn. We provide direct evidence that systemic bias is learned and amplified when training on biased, machine-generated labels, a critical finding for automated annotation pipelines. This work introduces a systematic framework for diagnosing algorithmic bias in medical segmentation and demonstrates that achieving fairness requires addressing qualitative distributional differences rather than merely balancing case counts.


Doubly-Regressing Approach for Subgroup Fairness

Lee, Kyungseon, Kim, Kunwoong, Lee, Jihu, Yang, Dongyoon, Kim, Yongdai

arXiv.org Machine Learning

Algorithmic fairness is a socially crucial topic in real-world applications of AI. Among many notions of fairness, subgroup fairness is widely studied when multiple sensitive attributes (e.g., gender, race, age) are present. However, as the number of sensitive attributes grows, the number of subgroups increases accordingly, creating heavy computational burdens and data sparsity problem (subgroups with too small sizes). In this paper, we develop a novel learning algorithm for subgroup fairness which resolves these issues by focusing on subgroups with sufficient sample sizes as well as marginal fairness (fairness for each sensitive attribute). To this end, we formalize a notion of subgroup-subset fairness and introduce a corresponding distributional fairness measure called the supremum Integral Probability Metric (supIPM). Building on this formulation, we propose the Doubly Regressing Adversarial learning for subgroup Fairness (DRAF) algorithm, which reduces a surrogate fairness gap for supIPM with much less computation than directly reducing supIPM. Theoretically, we prove that the proposed surrogate fairness gap is an upper bound of supIPM. Empirically, we show that the DRAF algorithm outperforms baseline methods in benchmark datasets, specifically when the number of sensitive attributes is large so that many subgroups are very small.


FairTune: A Bias-Aware Fine-Tuning Framework Towards Fair Heart Rate Prediction from PPG

Panchumarthi, Lovely Yeswanth, Kataria, Saurabh, Wu, Yi, Hu, Xiao, Fedorov, Alex, Kwak, Hyunjung Gloria

arXiv.org Artificial Intelligence

Foundation models pretrained on physiological data such as photoplethysmography (PPG) signals are increasingly used to improve heart rate (HR) prediction across diverse settings. Fine-tuning these models for local deployment is often seen as a practical and scalable strategy. However, its impact on demographic fairness--particularly under domain shifts--remains underexplored. We fine-tune PPG-GPT -- a transformer-based foundation model pretrained on intensive care unit (ICU) data -- across three heterogeneous datasets (ICU, wearable, smartphone) and systematically evaluate the effects on HR prediction accuracy and gender fairness. While fine-tuning substantially reduces mean absolute error (up to 80%), it can simultaneously widen fairness gaps, especially in larger models and under significant distributional characteristics shifts. To address this, we introduce FairTune, a bias-aware fine-tuning framework in which we benchmark three mitigation strategies: class weighting based on inverse group frequency (IF), Group Distributionally Robust Optimization (GroupDRO), and adversarial debiasing (ADV). We find that IF and GroupDRO significantly reduce fairness gaps without compromising accuracy, with effectiveness varying by deployment domain. Representation analyses further reveal that mitigation techniques reshape internal embeddings to reduce demographic clustering. Our findings highlight that fairness does not emerge as a natural byproduct of fine-tuning and that explicit mitigation is essential for equitable deployment of physiological foundation models.


Audits Under Resource, Data, and Access Constraints: Scaling Laws For Less Discriminatory Alternatives

Cen, Sarah H., Goyal, Salil, Javed, Zaynah, Karthik, Ananya, Liang, Percy, Ho, Daniel E.

arXiv.org Machine Learning

AI audits play a critical role in AI accountability and safety. One branch of the law for which AI audits are particularly salient is anti-discrimination law. Several areas of anti-discrimination law implicate the "less discriminatory alternative" (LDA) requirement, in which a protocol (e.g., model) is defensible if no less discriminatory protocol that achieves comparable performance can be found with a reasonable amount of effort. Notably, the burden of proving an LDA exists typically falls on the claimant (the party alleging discrimination). This creates a significant hurdle in AI cases, as the claimant would seemingly need to train a less discriminatory yet high-performing model, a task requiring resources and expertise beyond most litigants. Moreover, developers often shield information about and access to their model and training data as trade secrets, making it difficult to reproduce a similar model from scratch. In this work, we present a procedure enabling claimants to determine if an LDA exists, even when they have limited compute, data, information, and model access. We focus on the setting in which fairness is given by demographic parity and performance by binary cross-entropy loss. As our main result, we provide a novel closed-form upper bound for the loss-fairness Pareto frontier (PF). We show how the claimant can use it to fit a PF in the "low-resource regime," then extrapolate the PF that applies to the (large) model being contested, all without training a single large model. The expression thus serves as a scaling law for loss-fairness PFs. To use this scaling law, the claimant would require a small subsample of the train/test data. Then, the claimant can fit the context-specific PF by training as few as 7 (small) models. We stress test our main result in simulations, finding that our scaling law holds even when the exact conditions of our theory do not.


Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data

Moroianu, Stefania L., Bluethgen, Christian, Chambon, Pierre, Cherti, Mehdi, Delbrouck, Jean-Benoit, Paschali, Magdalini, Price, Brandon, Gichoya, Judy, Jitsev, Jenia, Langlotz, Curtis P., Chaudhari, Akshay S.

arXiv.org Artificial Intelligence

Achieving robust performance and fairness across diverse patient populations remains a challenge in developing clinically deployable deep learning models for diagnostic imaging. Synthetic data generation has emerged as a promising strategy to address limitations in dataset scale and diversity. We introduce RoentGen-v2, a text-to-image diffusion model for chest radiographs that enables fine-grained control over both radiographic findings and patient demographic attributes, including sex, age, and race/ethnicity. RoentGen-v2 is the first model to generate clinically plausible images with demographic conditioning, facilitating the creation of a large, demographically balanced synthetic dataset comprising over 565,000 images. We use this large synthetic dataset to evaluate optimal training pipelines for downstream disease classification models. In contrast to prior work that combines real and synthetic data naively, we propose an improved training strategy that leverages synthetic data for supervised pretraining, followed by fine-tuning on real data. Through extensive evaluation on over 137,000 chest radiographs from five institutions, we demonstrate that synthetic pretraining consistently improves model performance, generalization to out-of-distribution settings, and fairness across demographic subgroups. Across datasets, synthetic pretraining led to a 6.5% accuracy increase in the performance of downstream classification models, compared to a modest 2.7% increase when naively combining real and synthetic data. We observe this performance improvement simultaneously with the reduction of the underdiagnosis fairness gap by 19.3%. These results highlight the potential of synthetic imaging to advance equitable and generalizable medical deep learning under real-world data constraints. We open source our code, trained models, and synthetic dataset at https://github.com/StanfordMIMI/RoentGen-v2 .



Competing Risks: Impact on Risk Estimation and Algorithmic Fairness

Jeanselme, Vincent, Tom, Brian, Barrett, Jessica

arXiv.org Artificial Intelligence

Accurate time-to-event prediction is integral to decision-making, informing medical guidelines, hiring decisions, and resource allocation. Survival analysis, the quantitative framework used to model time-to-event data, accounts for patients who do not experience the event of interest during the study period, known as censored patients. However, many patients experience events that prevent the observation of the outcome of interest. These competing risks are often treated as censoring, a practice frequently overlooked due to a limited understanding of its consequences. Our work theoretically demonstrates why treating competing risks as censoring introduces substantial bias in survival estimates, leading to systematic overestimation of risk and, critically, amplifying disparities. First, we formalize the problem of misclassifying competing risks as censoring and quantify the resulting error in survival estimates. Specifically, we develop a framework to estimate this error and demonstrate the associated implications for predictive performance and algorithmic fairness. Furthermore, we examine how differing risk profiles across demographic groups lead to group-specific errors, potentially exacerbating existing disparities. Our findings, supported by an empirical analysis of cardiovascular management, demonstrate that ignoring competing risks disproportionately impacts the individuals most at risk of these events, potentially accentuating inequity. By quantifying the error and highlighting the fairness implications of the common practice of considering competing risks as censoring, our work provides a critical insight into the development of survival models: practitioners must account for competing risks to improve accuracy, reduce disparities in risk assessment, and better inform downstream decisions.


On the Interplay of Human-AI Alignment,Fairness, and Performance Trade-offs in Medical Imaging

Luo, Haozhe, Zhou, Ziyu, Shu, Zixin, de Mortanges, Aurélie Pahud, Berke, Robert, Reyes, Mauricio

arXiv.org Artificial Intelligence

Deep neural networks excel in medical imaging but remain prone to biases, leading to fairness gaps across demographic groups. We provide the first systematic exploration of Human-AI alignment and fairness in this domain. Our results show that incorporating human insights consistently reduces fairness gaps and enhances out-of-domain generalization, though excessive alignment can introduce performance trade-offs, emphasizing the need for calibrated strategies. These findings highlight Human-AI alignment as a promising approach for developing fair, robust, and generalizable medical AI systems, striking a balance between expert guidance and automated efficiency. Our code is available at https://github.com/Roypic/Aligner.


Group Fairness in Multi-Task Reinforcement Learning

Song, Kefan, Jiang, Runnan, Chandra, Rohan, Zhang, Shangtong

arXiv.org Artificial Intelligence

This paper addresses a critical societal consideration in the application of Reinforcement Learning (RL): ensuring equitable outcomes across different demographic groups in multi-task settings. While previous work has explored fairness in single-task RL, many real-world applications are multi-task in nature and require policies to maintain fairness across all tasks. We introduce a novel formulation of multi-task group fairness in RL and propose a constrained optimization algorithm that explicitly enforces fairness constraints across multiple tasks simultaneously. We have shown that our proposed algorithm does not violate fairness constraints with high probability and with sublinear regret in the finite-horizon episodic setting. Through experiments in RiverSwim and MuJoCo environments, we demonstrate that our approach better ensures group fairness across multiple tasks compared to previous methods that lack explicit multi-task fairness constraints in both the finite-horizon setting and the infinite-horizon setting. Our results show that the proposed algorithm achieves smaller fairness gaps while maintaining comparable returns across different demographic groups and tasks, suggesting its potential for addressing fairness concerns in real-world multi-task RL applications.